• Docs >
  • Online ASR with Emformer RNN-T
Shortcuts

Online ASR with Emformer RNN-T

Author: Jeff Hwang, Moto Hira

This tutorial shows how to use Emformer RNN-T and streaming API to perform online speech recognition.

1. Overview

Performing online speech recognition is composed of the following steps

  1. Build the inference pipeline Emformer RNN-T is composed of three components: feature extractor, decoder and token processor.

  2. Format the waveform into chunks of expected sizes.

  3. Pass data through the pipeline.

2. Preparation

Note

The streaming API requires FFmpeg libraries (>=4.1).

If you are using Anaconda Python distribution, conda install -c anaconda ffmpeg will install the required libraries.

When running this tutorial in Google Colab, the following command should do.

!add-apt-repository -y ppa:savoury1/ffmpeg4
!apt-get -qq install -y ffmpeg
import IPython
import torch
import torchaudio

print(torch.__version__)
print(torchaudio.__version__)

from torchaudio.prototype.io import Streamer

Out:

1.12.0.dev20220209+cpu
0.11.0.dev20220209+cpu

3. Construct the pipeline

Pre-trained model weights and related pipeline components are bundled as torchaudio.pipelines.RNNTBundle().

We use torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH(), which is a Emformer RNN-T model trained on LibriSpeech dataset.

bundle = torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH

feature_extractor = bundle.get_streaming_feature_extractor()
decoder = bundle.get_decoder()
token_processor = bundle.get_token_processor()

Out:

  0%|          | 0.00/3.81k [00:00<?, ?B/s]
100%|##########| 3.81k/3.81k [00:00<00:00, 16.2MB/s]
Downloading: "https://download.pytorch.org/torchaudio/models/emformer_rnnt_base_librispeech.pt" to /root/.cache/torch/hub/checkpoints/emformer_rnnt_base_librispeech.pt

  0%|          | 0.00/293M [00:00<?, ?B/s]
  0%|          | 960k/293M [00:00<00:32, 9.53MB/s]
  3%|2         | 8.05M/293M [00:00<00:06, 46.6MB/s]
  6%|5         | 16.5M/293M [00:00<00:04, 65.3MB/s]
  8%|7         | 22.9M/293M [00:00<00:04, 65.9MB/s]
 10%|9         | 29.2M/293M [00:00<00:04, 66.0MB/s]
 13%|#2        | 37.3M/293M [00:00<00:03, 72.5MB/s]
 15%|#5        | 45.3M/293M [00:00<00:03, 76.2MB/s]
 19%|#8        | 54.2M/293M [00:00<00:03, 81.6MB/s]
 21%|##1       | 62.6M/293M [00:00<00:02, 83.6MB/s]
 24%|##4       | 70.7M/293M [00:01<00:02, 84.0MB/s]
 27%|##6       | 78.7M/293M [00:01<00:02, 82.5MB/s]
 30%|##9       | 86.9M/293M [00:01<00:02, 83.4MB/s]
 33%|###2      | 96.0M/293M [00:01<00:02, 86.6MB/s]
 36%|###5      | 104M/293M [00:01<00:02, 87.1MB/s]
 38%|###8      | 113M/293M [00:01<00:02, 81.1MB/s]
 41%|####1     | 121M/293M [00:01<00:02, 77.2MB/s]
 44%|####3     | 128M/293M [00:01<00:02, 68.3MB/s]
 46%|####6     | 135M/293M [00:01<00:02, 70.9MB/s]
 49%|####8     | 143M/293M [00:02<00:02, 73.2MB/s]
 51%|#####1    | 150M/293M [00:02<00:02, 71.8MB/s]
 54%|#####3    | 158M/293M [00:02<00:01, 74.3MB/s]
 57%|#####6    | 166M/293M [00:02<00:01, 78.1MB/s]
 59%|#####9    | 174M/293M [00:02<00:01, 69.2MB/s]
 62%|######1   | 181M/293M [00:02<00:01, 71.6MB/s]
 64%|######4   | 188M/293M [00:02<00:01, 68.8MB/s]
 67%|######6   | 195M/293M [00:02<00:01, 70.3MB/s]
 69%|######9   | 203M/293M [00:02<00:01, 73.2MB/s]
 73%|#######2  | 213M/293M [00:02<00:01, 80.1MB/s]
 76%|#######5  | 221M/293M [00:03<00:00, 83.2MB/s]
 78%|#######8  | 229M/293M [00:03<00:00, 82.2MB/s]
 81%|########1 | 238M/293M [00:03<00:00, 84.7MB/s]
 84%|########4 | 246M/293M [00:03<00:00, 84.6MB/s]
 87%|########6 | 254M/293M [00:03<00:00, 75.3MB/s]
 90%|########9 | 263M/293M [00:03<00:00, 78.3MB/s]
 92%|#########2| 270M/293M [00:03<00:00, 75.3MB/s]
 95%|#########5| 278M/293M [00:03<00:00, 78.4MB/s]
 98%|#########7| 286M/293M [00:03<00:00, 71.7MB/s]
100%|##########| 293M/293M [00:04<00:00, 74.8MB/s]

  0%|          | 0.00/295k [00:00<?, ?B/s]
100%|##########| 295k/295k [00:00<00:00, 79.4MB/s]

Streaming inference works on input data with overlap. Emformer RNN-T expects right context like the following.

https://download.pytorch.org/torchaudio/tutorial-assets/emformer_rnnt_context.png

The size of main segment and right context, along with the expected sample rate can be retrieved from bundle.

sample_rate = bundle.sample_rate
frames_per_chunk = bundle.segment_length * bundle.hop_length
right_context_size = bundle.right_context_length * bundle.hop_length

print(f"Sample rate: {sample_rate}")
print(f"Main segment: {frames_per_chunk} frames ({frames_per_chunk / sample_rate} seconds)")
print(f"Right context: {right_context_size} frames ({right_context_size / sample_rate} seconds)")

Out:

Sample rate: 16000
Main segment: 2560 frames (0.16 seconds)
Right context: 640 frames (0.04 seconds)

4. Configure the audio stream

Next, we configure the input audio stream using Streamer().

For the detail of this API, please refer to the Media Stream API tutorial.

The following audio file was originally published by LibriVox project, and it is in the public domain.

https://librivox.org/great-pirate-stories-by-joseph-lewis-french/

It was re-uploaded for the sake of the tutorial.

src = "https://download.pytorch.org/torchaudio/tutorial-assets/greatpiratestories_00_various.mp3"

streamer = Streamer(src)
streamer.add_basic_audio_stream(frames_per_chunk=frames_per_chunk, sample_rate=bundle.sample_rate)

print(streamer.get_src_stream_info(0))
print(streamer.get_out_stream_info(0))

Out:

SourceAudioStream(media_type='audio', codec='mp3', codec_long_name='MP3 (MPEG audio layer 3)', format='fltp', bit_rate=128000, sample_rate=44100.0, num_channels=2)
OutputStream(source_index=0, filter_description='aresample=16000,aformat=sample_fmts=fltp')

Streamer iterate the source media without overlap, so we make a helper structure that caches a chunk and return it with right context appended when the next chunk is given.

class ContextCacher:
    """Cache the previous chunk and combine it with the new chunk

    Args:
        chunk (torch.Tensor): Initial chunk
        right_context_size (int): The size of right context.
    """

    def __init__(self, chunk: torch.Tensor, right_context_size: int):
        self.chunk = chunk
        self.right_context_size = right_context_size

    def __call__(self, chunk: torch.Tensor):
        right_context = chunk[: self.right_context_size, :]
        chunk_with_context = torch.cat((self.chunk, right_context))
        self.chunk = chunk
        return chunk_with_context

5. Run stream inference

Finally, we run the recognition.

First, we initialize the stream iterator, context cacher, and state and hypothesis that are used by decoder to carry over the decoding state between inference calls.

stream_iterator = streamer.stream()
cacher = ContextCacher(next(stream_iterator)[0], right_context_size)

state, hypothesis = None, None

Next we, run the inference.

For the sake of better display, we create a helper function which processes the source stream up to the given times and call it repeatedly.

@torch.inference_mode()
def run_inference(num_iter=200):
    global state, hypothesis
    chunks = []
    for i, (chunk,) in enumerate(stream_iterator, start=1):
        segment = cacher(chunk).T[0]
        features, length = feature_extractor(segment)
        hypos, state = decoder.infer(features, length, 10, state=state, hypothesis=hypothesis)
        hypothesis = hypos[0]
        transcript = token_processor(hypothesis.tokens, lstrip=False)
        print(transcript, end="", flush=True)

        chunks.append(chunk)
        if i == num_iter:
            break

    return IPython.display.Audio(torch.cat(chunks).T.numpy(), rate=bundle.sample_rate)
run_inference()

Out:

forward great pirate's this is aver vice recordings are in the public domain for more information or please visitor recording by james christoper great pirite stories by various eded by joseph's fordie emboys the romance of the sea in its highest expression it is a sad but inevable commentary on our civilization that so far as the sea is concerned it


run_inference()

Out:

is developed from its infancy down to a century or so ago under one phase or another of pircy if men were savages on land they were doubly so at sea and all the years oftime adventure years that added to theap world there was little left to discover could not wholly eradic theat germ it went out gradually with the settlement and ordering of the far british colonies great britain foremost of sea powers must bered doing more both directly and indirectorally for the abolition of crime and disord on the high seas than any other


run_inference()

Out:

force but the conquest was not complete till the steam which chased the rover into the furthesters of his domain it is said that he survives even to day in certain spots in the chines but he is certainly a pir of any sort would be as great a curiosity to day if he could be caught and exhibited as a fab the fact remains and will always persist that lore of the sea is far away the most picturesque figure in the more genuine gross his character the higher degree of interest as he inspire there


run_inference()

Out:

may be a certain perversity in this for the pirate was unquestionably a bad man at his best or worst considering his surroundings and conditions undoubtily the worst man that ever lived there is little to soften the dark yet glowing picture of his exploits but again it must be remembered that only does the note of distant subdue and even lend a certain enchant to the scene but the effective contrast between our peaceful times andributes much to deepen our interest in him perhaps it is this latter added to that death wasp on the human breast that


run_inference()

Out:

gloves at the tale which makes them the kind of hero of romance that is to day he is undially a redoubtable historical figure it is a curious fact that commer seas cradled in the lap of bucci the constant danger of thes in this form only made heartier mariners out of the merchant adventurers actually stimating and strengthening marits bucc is only a polite for piry thus became the high romance of theas during the great centuries oftime adventure it went in hand with discovery


run_inference()

Out:

they were in fact almost inseparable most of the mighty mariners from the days of the discoverer through those of the redoubtable sir francis drake down to her own jones answered to the roll it was a bold hearty world this avarice up to the advent of our giant's steam every foot of which was won my fierce conquest of one sort or another out of this passed the pir emerges are romantic even at times heroic figure this final nic despite his crimes cannot altogether be denied a hero he is and will remain


run_inference()

Out:

so long as tales of theer told so have at him in these pages joseth lewherents and of recording james christopher jist christopher at yucha come


Total running time of the script: ( 1 minutes 1.579 seconds)

Gallery generated by Sphinx-Gallery

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources